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Considerable  progress  was  made  in  developing  artificial  neural  network  methods  for 
solving  stochastic  sequential  decision  problems.  The  research  focused  on 
reinforcement  learning  methods  based  on  approximating  dynamic  programming  (DP). 

They  used  problems  In  the  domains  of  robot  fine  motion  control,  navigation,  and 
steering  control  in  order  to  develop  and  test  learning  algorithms  and 
architectures.  Although  most  of  these  problems  were  simulated,  they  also  began  to 
apply  DP-based  learning  algorithms  to  actual  robot  control  problems  with 
considerable  success.  Progress  was  made  on  reinforcement  Learning  methods  using 
continuous  actions,  modular  network  architectures,  and  architectures  using  abstract 
actlon-s.  Theoretical  progress  was  made  In  relating  DP-based  reinforcement  learning 
algorithms  to  more  conventional  methods  for  solving  stochastic  sequential  decision 
problems.  As  a  result  of  this  research  there  is  an  improved  understanding  of  these 
algorithms  and  how  they  can  bo  successfully  used  in  applications. 
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Summary— Considerable  progress  was  made  in  developing  artificial  neural  network  rr,rth<al<; 
for  solving  stochastic  sequential  decision  prcblems.  Our  research  focused  on  reinforcement 
learning  methods  based  on  approximating  dynamic  programming  (DP)  We  used  problems 
in  the  domains  of  robot  fine  motion  control,  navigation,  and  steering  control  in  order  to 
develop  and  test  learning  algorithms  and  architectures.  Although  most  of  these  pnTIerns 
were  simulated,  we  also  began  to  apply  DP-based  learning  algorithms  to  actual  robot  control 
problems  with  considerable  success.  Progress  was  made  on  reinforcement  learning  methods 
using  continuous  actions,  modular  nct’scrk  architectures,  and  architectures  using  abstract 
actions.  Theoretical  progress  was  trade  in  relating  DP-based  reinforcement  learning  algo¬ 
rithms  to  more  conventional  methods  for  solving  stochastic  sequential  decision  problems.  .As 
a  result  of  this  research,  we  have  a  much  improved  understanding  of  these  algorithms  and 
how  they  can  be  successfully  used  in  applications. 


1  Introduction 

Following  is  the  summary  of  the  research  proposal  that  led  to  funding  of  the  research 
being  reported  here.  It  states  the  research  objectives. 


This  project  seeks  to  develop  learning  methods  for  artificial  neural  networks 
(or  connectionist  netw'orks)  for  application  to  problems  formalized  as  stochas¬ 
tic  sequential  decision  problems.  In  these  problems  the  consequences  of  network 
actions  unfold  over  an  extended  time  period  after  an  action  is  taken,  so  that 
actions  must  be  selected  on  the  basis  of  both  their  short-term  and  long-term 
consequences  and  under  uncertainty.  Problems  of  this  kind  can  be  viewed  as 
discrete-time  stochastic  control  problems.  The  theory  of  stochastic  sequential 
decision  making  and  the  computational  techniques  associated  with  it.  known  as 
stochastic  dynamic  programming,  provide  ways  of  understanding  the  capabili¬ 
ties  of  the  reinforcement-learning  and  temporal  credit-assignment  methods  we 
previously  developed  and  suggest  a  variety  of  extensions  to  them  which  can  be 
implemented  as  adaptive  networks.  These  extensions  involve  model-based  and 
hierarchical  learning.  The  long-term  goal  of  this  research  is  the  development 
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of  net  work  methods  for  the  efficient  solution  of  stochastic  sequential  derisn  i! 
problems  in  the  absence  of  complete  knowledge  of  underlying  dynamics 


We  made  considerable  progress  in  furthering  the  development  of  DP-l)ased  reinf- ■rcMiii'-ni 
learning  algorithms  and  in  understanding  their  properties  and  domains  of  utility  Bef.w  we 
describe  our  major  accomplishments.  Some  aspects  of  this  project  were  closely  r'daiefl  ti. 
research  funded  under  National  Science  Foundation  Grant  EC'S-8912623. 


2  Reinforcement  Learning  of  Continuous  Values 

Part  of  our  research  addiessed  methods  for  allowing  networks  with  continuous  outputs  tr, 
learn  via  reinforcement  learning.  Although  this  work  did  not  explicitly  rely  on  the  formali.sm 
of  sequential  decision  problems,  it  addressed  a  capability  that  learning  systems  must  have  P-r 
a  wide  range  of  such  problems.  Whereas  most  reinforcement  learning  systems  are  restricted  to 
a  finite  set  of  actions,  many  sequential  decision  problems  require  learning  over  a  continuous 
range  of  actions.  Our  effort  focused  on  Stochastic  Real-Valued  (SRV)  units,  which  are 
neuron-like  units  with  real-valued  outputs  that  can  be  trained  via  reinforcement  feedback. 
SRV  units  were  developed  by  V.  Gullapalli  with  support  from  this  grant  and  formed  the 
basis  of  his  Ph.D  dissertation  (he  received  the  Ph.D.  in  1992).  We  conducted  a  number  of 
experiments  using  SRV  units  in  a  simulated  pole-balancing  task  and  control  of  a  simulated 
three  degree-of-freedom  robot  arm  in  an  underconstrained  positioning  task.  Results  indicated 
that  networks  using  SRV  units  can  learn  these  tasks  faster  than  networks  based  on  supervised 
learning.  Gullapalli  has  published  a  journal  article,  several  conference  papers,  and  a  bo'jk 
chapter  on  this  work. 

Gullapalli  also  used  SRV  units  in  a  neural  network  model  of  perception  by  training 
a  network  with  SRV  units  to  model  area  7a  of  the  posterior  parietal  cortex,  a  cortical 
area  thought  to  transform  visual  stimuli  from  retinotopic  coordinates  into  a  head-centered 
coordinate  system  [5l  Results  showed  that  the  SRV  network  reproduces  the  performance 
of  previous  models  while  being  free  of  some  of  their  limitations  with  respect  to  biological 
plausibility. 

Based  on  the  promise  shown  by  these  simulations,  we  applied  a  network  using  SRV' 
iinit.s  to  the  problem  of  robot  peg  in  hole  insertion  using  a  robot  arm  (a  Zebra  Zero).  We 
achieved  very  promising  results,  described  in  refs.  [7;  6|.  This  task  is  important  in  industrial 
robotics  and  is  widely  used  by  roboticists  for  testing  approaches  to  robot  control.  Real- world 
conditions  of  uncertainty  and  noise  can  substantially  degrade  the  performance  of  traditional 
control  methods.  Sources  of  uncertainty  and  noise  include  (1)  errors  and  noise  in  sensations. 
(2)  errors  in  execution  of  motion  commands,  and  (3)  uncertainty  due  to  movement  nf  the 
part  grasped  by  the  robot.  Under  such  conditions,  traditional  methods  do  not  perform  very 
well,  and  the  peg-insertion  problem  becomes  a  good  candidate  for  adaptive  methods.  Ff>r 
pvafTipl»,  ift  the  robot  we  used  there  is  a  large  discrepancy  between  the  sensed  and  actual 
positions  of  the  peg  under  an  external  load  similar  to  what  can  occur  during  peg  insertion: 
whereas  the  actual  change  in  the  peg’s  position  under  the  external  load  was  on  the  order  of 
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2  to  3mm,  the  largest  sensed  change  in  position  was  less  than  0  ()25mm  In  romparis<'n,  the 
clearance  between  the  peg  and  the  hole  was  0. 1 75mm, 

Although  it  is  difficult  to  design  a  controller  that  can  robustly  perform  peg  insert i'uis 
despite  the  large  uncertainty  in  sensory  input,  our  results  indicate  that  direct  rpinf'>rcement 
learning  can  be  used  to  learn  a  reactive  control  strategy  that  works  robustly  ev^n  in  the 
presence  of  such  a  high  degree  of  uncertainty.  In  a  2D  version  of  the  task  (basically,  inserting 
a  peg  into  a  narrow  slot)  the  controller  was  consistently  able  to  perform  successful  insertions 
within  lot)  time  steps  after  about  150  learning  trials.  Furthermore,  performance  as  measured 
by  insertion  time  continued  to  improve,  decreasing  continuous!}'  over  learning  trials.  Fhe 
controller  became  progressively  more  skillful  at  peg  insertion  with  training.  Similar  results 
were  obtained  in  a  3D  task  although  learning  took  somewhat  more  trials. 

Our  experiences  with  this  problem  helped  develop  the  following  perspective  on  an  im¬ 
portant  issue  in  control.  The  issue  is  when  to  approach  a  difficult  control  problem  by  first 
attempting  to  construct  an  accurate  model  of  the  system  being  controlled,  versus  wlien  to 
attempt  to  solve  the  problem  directly,  i.e.,  without  such  a  model.  We  argue  that  for  .seme 
problems  constructing  an  adequate  model  is  actually  more  difficult  than  solving  the  prob¬ 
lem  itself.  In  robotics,  it  is  a  model  of  the  task,  e.g..  a  manipulation  task,  that  is  often 
problematic,  not  a  model  of  the  robot  itself.  Adaptive  control  methods  appealing  directly  to 
the  demands  of  the  real  task  instead  of  to  a  model  of  the  task  can  be  very  effective  in  such 
problems. 


3  Navigation  and  Steering  Control 

Navigation  and  steering  control  problems  provide  useful  test  beds  for  exploring  reinforce¬ 
ment  learning  algorithms  for  sequential  decision  problems.  The  basic  form  of  these  problems 
is  that  some  kind  of  “vehicle”  must  move  to  a  goal  region  of  its  environment  while  avoiding 
obstacles.  Learning  is  used  to  improve  the  vehicle’s  performance  with  successive  trials  in 
terms  of  the  distance  traveled,  the  time  required  to  reach  the  goal  region,  or  other  criteria. 
We  have  restricted  attention  to  problems  in  which  the  environment  is  static  in  that  it  does 
not  contain  moving  obstacles  or  other  vehicles.  By  learning  to  navigate  we  mean  learning 
the  direction  the  vehicle  should  move  from  each  location  in  order  to  reach  the  goal  region 
along  successively  better  paths.  By  learning  to  “steer,”  on  the  other  hand,  we  mean  learning 
fr  -r.^trol  a  dynamic  vehicle  (for  example,  a  vehicle  that  has  mass  and  inertia),  so  that  it 
reaches  the  goal  region  via  successively  more  efficient  trajectories.  Often  we  are  only  inter¬ 
ested  in  reaching  the  goal  region  in  the  minimum  amount  of  time.  Navigation  and  steering 
control  also  apply  to  more  abstract  spaces,  such  as  the  configuration  space  of  a  robot  ma¬ 
nipulator,  instead  of  two-  or  three-dimensional  cartesian  space.  Many  different  v»>r=ie>ns  of 
these  problems  exist  dcpcnuiag  ou  Ihe  sensory  and  motor  capabilities  of  the  vehicle  and  on 
the  structure  of  the  underlying  space. 

Although  navigation  and  steering  control  have  obvious  practical  applications,  we  have 
used  abstract  versions  of  these  problems  as  tools  for  helping  us  understand  and  refine  DP- 
based  reinforcement  learning  algorithms.  However,  our  work  is  relevant  to  realistic  examples 
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of  these  problems,  and  some  of  our  recent  research,  as  well  as  research  in  '  it  her  groups 
experiments  with  these  methods  in  actual  navigation  and  steering  control  pr'-.blems 


3.1  Navigation 

W'e  developed  a  navigation  test-bed  simulating  the  movement  of  a  cylindrical  robot  witli  a 
Sonar  belt  in  a  planar  environment.  This  test-bed  was  first  used  to  study  short-range  homing 
in  the  presence  of  obstacles,  that  is.  going  to  a  “home"  place  from  an  arbitrary  starting  plaro 
within  a  neighborhood  of  the  home  place.  The  simulated  robot  has  16  distance  sensors  and 
16  grey-scale  sensors  evenly  placed  around  its  perimeter.  Thus,  the  input  to  the  learning 
system  at  any  time  is  a  “sensation"  vector  of  32  real  numbers  representing  its  current  vm-.v 
(T  the  environment.  (Other  versions  of  this  test-bed  used  fewer  simulated  sensors)  Tlii.s 
contrasts  with  various  “grid-world"  navigation  problems  that  we  have  studied  in  the  past, 
and  that  other  groups  are  studying,  in  which  the  robot  moves  from  square  to  square  in  a 
discretized  environment. 

This  test-bed  was  used  to  illustrate  the  behavior  of  several  DP-based  learning  architec¬ 
tures.  One  architecture  was  developed  by  J.  Bachrach  !  1;  21.  It  takes  a  structured  approach  to 
the  problem  and  utilizes  a  priori  knowledge  of  how  local  changes  in  position  tend  to  change 
the  robot’s  view.  The  homing  aspect  of  the  task  and  the  obstacle  avoidance  aspect  are 
handled  by  separate  modules,  implemented  as  “adaptive  critics"  that  improve  “evaluation 
landscape,s"  with  experience.  .4n  evaluation  landscape  in  this  case  is  a  real-valued  function 
of  the  space  of  possible  sensations;  the  higher  the  value  of  a  sensation,  the  more  the  robot 
desires  to  be  there.  One  critic  learns  to  produce  a  gradually  sloping  evaluation  landscape 
with  a  maximum  at  the  home  place.  The  other  critic  learns  to  place  evaluation  minima 
around  obstacles.  Gradient  descent  in  the  ex'aluation  landscape  formed  by  the  superposition 
of  the  landscapes  implemented  by  the  two  critics  produces  a  trajectory  that  both  avoids 
obstacles  and  moves  towards  home.  This  is  related  to  the  technique  of  potential  functions, 
liut  differs  in  that  it  is  perceptually-based  and  involves  learning.  That  is,  the  evaluation 
landscape,  which  is  improved  through  experience,  only  evaluates  sensations  directly;  it  does 
not  directly  evaluate  places  in  space.  Places  indirectly  receive  evaluation  according  to  the 
sensations  that  the  robot  would  receive  if  it  moved  to  them.  Thus,  the  robot  does  not  have 
to  maintain  a  “bird's  eye”  view  of  the  environment.  This  navigation  control  architecture 
is  described  in  Bachrach’s  Ph  D,  dissertation,  completed  in  1991.  This  work  was  our  first 
experience  with  using  reinforcement  learning  in  a  control  scheme  that  is  “behavior-based" 
in  the  sense  of  coordinating  several  different  behaviors  (homing  and  obstacle  avoidance). 

This  test-bed  was  also  used  to  illustrate  a  modular  learning  architecture  developed  by 
S,  Siiiirh  d  P  that  leains  several  different  homing/obstacle  avoidance  tasks  in  the  same  envi- 
reuiment.  This  is  discussed  below  in  the  section  i^n  modular  architecttircs 
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3.2  Steering  Control 


To  study  steering  control,  we  adpoted  the  “race  track  problem  where  a  <;iartmg  iiiif* 
and  a  finish  line  are  given  in  a  two-dimensional  workspace,  along  with  twi.i  curves  rciiiiecting 
correspi'inding  edges  of  the  starting  and  finish  lines.  The  two  curves  represent  the  tw  side 
walls  of  the  race  track,  and  the  region  enclosed  by  the  walls  and  the  starting  and  finish  lines 
is  the  admissible  region  of  the  workspace.  As  a  “vehicle"  we  basically  use  a  unit  mass  with 
no  damping  and  stiffness.  The  controller  applies  bounded  forces  at  discrete  time  intervals 
I'll  the  mass.  The  objective  is  to  push  it  from  the  starting  line  to  the  finish  line  in  minimum 
time  without  hitting  the  walls.  Hitting  a  wall  at  any  point  is  considered  as  ontroller  failure 
There  are  no  constraints  on  the  velocity  at  the  finish  line,  so  that  any  crossing  of  the  finish 
line  is  regarded  as  success.  The  difficulty  of  this  problem  can  be  adjusted  by  tiie  selection 
of  the  race  track  size  and  shape,  the  bound  on  controller  forces,  and  the  mass  of  the  vehicle. 
The  problem  can  be  made  stochastic  in  a  variety  of  ways. 

We  began  with  a  version  of  the  race  track  problem  having  a  continuous  state  space 
The  vehicle  could  occupy  a  continuum  of  places  and  move  at  an  arbitrary  velocity  ()n 
a  simple  example  of  the  racetrack  problem  (turning  a  single  retangular  corner),  our  DP- 
based  learning  scheme  using  radial  basis  functions  was  able  to  produce  successively  faster 
times  to  the  finish  line  by  learning  to  take  the  corner  at  increasingly  better  trajectories,  but 
learning  was  very  slow.  Our  research  therefore  went  in  two  directions:  1)  We  used  a  finite- 
state  racetrack  problem  to  compare  our  DP-based  learning  algorithms  with  the  conventional 
solution  method  (conventional  DP).  This  version  of  the  problem  satisfies  the  conditions 
required  for  a  convergence  theorem  we  proved.  i3j.  2)  This  problem  cries  out  strongly  for  the 
application  of  a  modular  architecture  in  which  different  modules  are  switched  in  for  different 
track  configurations.  This  motivated  the  study  of  extending  the  modular  architecture  Jacobs 
18;  9!  to  apply  to  this  and  similar  problems,  described  below. 


4  Modular  Architectures 

Work  on  a  modular  network  architecture  was  begun  under  the  previous  AFO.SR  grant. 
This  work  was  completed  in  the  period  being  reported  and  formed  the  basis  of  the  Ph  D. 
dissertation  of  R.  A.  Jacobs.  This  is  a  method  for  improving  the  learning  ability  of  arti¬ 
ficial  neural  networks  by  organizing  several  networks  into  a  modular  structure  [8:  9].  One 
arl vantage  of  such  a  structure  is  that  the  individual  networks  are  not  faced  with  solving  large 
problems  in  their  entirety.  Large  problems  are  solved  by  the  combined  efforts  of  sevpral 
networks.  The  learning  method  is  a  generalization  of  the  unsupervised  learning  method  of 
competitive  learning  to  the  supervised  case.  After  Jacobs  was  awarded  the  Ph  D.  in  May 
199<b  he  worked  as  a  post  doctoral  researcher  at  MIT  under  the  direction  of  Michael  Jordan 
before  taking  his  current  position  as  Assistant  Professor  of  Psychology  at  the  University 
of  Rochester.  This  work  has  been  very  influential  in  the  neural  network  community,  and 
current  work  of  Jacobs  and  Jordan  continues  to  develop  this  basic  idea  with  considerable 
success. 
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Whereas  Jacobs'  architecture  is  for  supervised  learning,  our  own  researcii  with  ttviduiar 
architectures  extended  Jacobs'  ideas  to  a  modular  architecture  for  reiufnrcerTU'Ut  h-amitig 
The  ideas  was  to  develop  a  learning  architecture  which  would  facilitate  transfer  of  learn¬ 
ing  among  multiple  sequential  decision  tasks.  This  is  important  because  sophisticated  au- 
tononious  agents  will  have  to  learn  to  solve  many  different  tasks,  not  just  rme,  they  sliould 
learn  throughout  their  "lives.  "  While  achieving  transfer  of  learning  across  an  ari)itrary  set 
of  tasks  is  difficult,  or  even  impossible,  there  are  useful  and  general  classes  of  tasks  where 
such  transfer  is  achievable.  We  focused  on  extending  DP-based  reinforcement  learning  al¬ 
gorithms  to  compositionally  structured  sets  of  sequential  decision  tasks.  Specifically,  we 
studied  learning  agents  that  have  to  learn  to  solve  a  set  of  sequential  decisiiui  tasks,  where 
the  more  complex  tasks,  called  composite  tasks,  are  formed  by  temporally  concatenating  sev¬ 
eral  simpler,  or  elemental,  tasks.  Learning  occurred  under  the  assumption  that  a  composite 
task  s  decomposition  into  a  sequence  of  elemental  tasks  was  unknown  to  the  learning  agent 

Our  architecture,  called  CQ-L,  performs  compositional  O-learning.  where  Q-learning  is 
a  DP-based  reinforcement  learning  method  proposed  by  Watkins  To:  I6L  It  is  a  kind  of 
Monte  Carlo  DP  method  for  estimating  the  value  of  performing  various  actions  when  the 
environment  is  in  various  states.  These  values  are  stored  in  a  function  called  the  Q-function 
of  the  task.  CQ-L  consists  of  several  Q-learning  modules,  a  gating  module,  and  a  bias 
module.  In  different  simulations  these  modules  were  variously  implemented  as  lookup  tables 
or  as  radial  basis  networks.  When  trained  on  a  set  of  compositionally-structured  sequential 
decision  tasks.  CQ-L  is  able  to  do  the  following:  1)  learn  the  Q-functions  of  the  elemental 
tasks  in  separate  Q-learning  modules;  2)  determines  the  decomposition  of  the  composite 
tasks  in  terms  of  the  elemental  tasks:  3)  learns  to  construct  the  Q-functions  of  the  composite 
tasks  by  temporally  concatenating  the  Q-fanctions  of  the  elemental  tasks;  and  4)  learns  the 
constant  biases  that  are  added  to  the  Q- value  functions  of  the  elemental  tasks  to  construct 
the  Q-value  function  of  the  composite  tasks. 

Simulations  using  the  navigation  testbed  described  above  showed  that  CQ-L  is  able  to 
learn  tasks  complex  enough  to  evade  solution  via  a  conventional  DP-based  learning  architec¬ 
ture.  CQ-L  is  more  powerful  than  the  conventional  architecture  because  it  uses  solutions  of 
the  elemental  tasks  as  building  blocks  for  solving  the  composite  tasks.  Transfer  of  learning  is 
achieved  by  sharing  the  elemental  task  solutions  across  several  composite  tasks.  This  is  work 
of  S.  P.  Singh,  a  research  assistant  who  has  been  funded  by  this  grant.  Singh  has  published 
several  papers  on  his  work  [14;  12;  13]  and  is  expected  to  complete  the  Ph.D.  degree  in 
the  summer  of  1993.  Singh's  work  has  already  been  influential  in  the  AI  Machine  Learning 
research  community,  where  increasing  attention  is  being  devoted  to  DP-based  reinf'^rcement 
learning  as  a  component  of  intelligent  agents. 


5  Abstract  Actions 

Closely  related  to  our  work  with  modular  architectures  is  our  study  DP-based  learning 
with  abstract  actions.  Most  applications  of  DP-based  learning  described  in  the  literature 
use  these  methods  at  a  very  low  level.  For  example,  the  learning  component's  actions  may 
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be  primitive  movements  in  a  navigation  problem.  This  low  level  of  abstraction  gr-iierally 
produces  v^ry  difficult  tasks  that  can  be  learned  only  very  sh>wly.  Part  <.f  oiir  rc<;f>nr<  li 
effort  has  been  directed  toward  raising  the  level  of  abstraction  at  which  DP-basnl  hannm) 
algorithms  are  applied.  One  way  to  do  this  is  by  letting  the  learning  comp'inent  s  actions  b<‘ 
control  signals  to  other  system  components  instead  of  low-level  overt  actions  in  the  sv<;lem  s 
environment.  This  is  one  way  to  incorporate  prior  knowledge  into  a  learning  systfun  in  ord»  r 
tc>  improve  its  performance,  and  it  addresses  the  problem  of  having  the  system  perform 
acceptably  while  it  is  learning;  If  a  learning  system  is  to  learn  from  its  failtire.s.  In  w  can 
one  prevent  these  failures  from  producing  inconvenient,  expensive,  or  catastrophic  results  ' 
This  issue,  perhaps  mere  than  any  other,  has  limited  the  utility  of  DP-based  reinforcement 
learning  in  many  real-world  applications.  One  answer  is  to  use  reinforcement  learning  as  a 
component  of  a  more  complex  system. 

We  experimented  with  a  kind  of  “bahavior  based"  reinforcement  learning  in  which  the 
learning  component's  task  is  to  learn  how  to  coordinate  a  repertoire  of  behaviors  that  have 
been  hand-crafted  to  1)  achieve  desired  goals,  and  2)  avoid  catastrophic  failure.  Learning 
the  right  way  to  compose  these  behaviors  in  a  state-dependent  manner  can  impno-e  the 
system's  behavior  toward  optimality  while  it  is  operating  adequately.  We  are  currently 
applying  these  ideas  to  the  navigation  domain.  The  abstract  actions  correspond  t"  two 
navigation  functions  that  are  computed  by  using  the  harmonic  function  approach  t'>  paih- 
plaiiniiig  recently  developed  by  Connolly  and  Grupen.  colleagues  doing  robotics  research  at 
the  University  of  Massachusetts, 

In  harmonic  function  path  planning,  navigation  functions  are  obtained  as  snlaticms  '•>[ 
Laplace’s  equation  (an  elliptic  partial  differential  equation)  over  the  relevant  robot  configu¬ 
ration  space.  A  navigation  function  is  a  function  with  the  property  that  a  robot  following 
its  gradient  from  any  point  in  space  is  guaranteed  to  reach  the  goal  configuration  while 
avoiding  all  obstacles.  Different  boundary  conditions  of  Laplace's  equation  produce  differ¬ 
ent  navigation  functions.  One  such  function  (obtained  using  Dirichiet  boundary  conditions) 
tends  to  repel  the  robot  directly  away  from  obstacles  while  attracting  it  to  the  goal.  .Another 
navigation  (obtained  using  Neumann  boundary  conditions)  tends  make  the  robot  “hug"  the 
obstacle  boundaries  while  attracting  it  to  the  goal. 

We  experimented  with  using  DP-based  learning  to  adjust  how  these  functions  were  com¬ 
bined  to  produce  another  navigation  function  enabling  the  robot  to  reach  the  goal  much  faster 
than  it  could  using  either  function  alone.  This  can  be  done  in  such  a  way  that  throughout 
repeated  learning  trials,  the  robot  always  reaches  its  goal  and  never  hits  an  obstacle.  Thus 
learning  can  occur  on-line  while  the  robot  is  actually  performing  its  designated  task  with¬ 
out  risking  inadequate  performance.  Reinforcement  learning  is  used  for  perfecting  skilled 
perfi Tmance,  not  for  achieving  adequate  performance.  We  think  that  reinforcement  learning 
will  be  most  useful  in  this  capacity.  We  produced  successful  demonstrations  of  these  ideas 
in  simulated  environments,  and  we  are  currently  applying  them  to  an  actual  GE  P-5(l  robot 
arm 
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6  Theory 


We  have  made  considerable  progress  in  increasing  our  theoretical  understanding  DP- 
based  reinforcement  learning  methods  and  how  they  relate  to  other  methods.  We  wo'te  an 
extensive  paper  !3l.  still  under  review  for  Artificial  Intelligence  Journal),  that  relates  these 
learning  algorithms  to  the  theory  of  asynchronous  DP  '4  and  to  the  heuristic  search  method 
called  Learning  Real-Time  A*  ilO'.  This  resulted  in  a  convergence  theorem  L>r  a  class 
DP-based  algorithms  and  clearly  articulates  the  advantages  they  offer  over  convetif i- -nal 
methods  for  some  types  of  problems.  We  have  also  begun  development  of  the<^>ry  m  which 
some  versions  of  DP-based  learning  algorithms  can  be  derived  as  Robbins-Monro  types  <>[ 
stochastic  approximation  methods  for  solving  the  Bellman  optimality  equation.  \Ve  are 
currently  studying  the  stochastic  approximation  literature  to  derive  asymptotic  convergence 
results  as  well  as  rate  of  convergence  results. 


7  Conclusion 


The  period  covered  by  this  grant  has  seen  a  remarkable  increase  in  the  number  <>f  re¬ 
searchers  studying  DP-based  reinforcement  learning.  This  is  due  in  part  to  increased  interest 
in  the  study  of  embedded  autonomous  agents.  Learning  is  being  widely  recognized  as  an 
essential  capabability  of  such  agents,  and  DP-based  reinforcement  learning  is  directly  ap¬ 
plicable  to  the  kinds  of  problems  such  agents  face.  Our  research  funded  by  this  and  other 
grants,  as  well  as  the  research  conducted  at  other  laboratories,  is  quickly  moving  these 
methods  toward  becoming  standard  tools  that  can  be  successfully  applied  to  a  wide  range 
of  problems.  While  the  theory  of  these  algorithms  is  still  underde^'eloped,  we  now  have  a 
much  clearer  idea  of  how  they  are  related  to  more  traditional  methods  of  decision  theory 
and  control.  We  are  convinced  that  DP-based  reinforcement  learning,  in  all  of  its  varieties, 
is  a  collection  of  novel  algorithms  that  will  find  increasing  use  in  forming  useful  approxmate 
solutions  to  stochastic  sequential  decision  problems  of  practical  importance. 
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COINS  Technical  Report  92-10,  University  of  Massachusetts,  .Amherst.  January  1992, 

External  honors,  etc. 

Andrew  G.  Barto  became  a  Senior  Fellow  of  IEEE. 

.Andrew  G.  Barto  gave  an  invited  plenary  address  entitled  “Learning  to  .Act;  .A  Per¬ 
spective  from  Control  Theory  ’  at  the  Tenth  .Annual  Meeting  of  the  .American 
Association  for  Artificial  Intelligence  (.4.AAI-92)  at  San  Jose,  C.A,  July  15.  1992. 

Andrew  G.  Barto  gave  the  invited  plenary  lecture,  entitled  “Reinforcement  Learning,” 
at  the  1992  Conference  on  Learning  Theory  at  the  University  of  Pittsburgh,  July 
27,  1992. 
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